This is my fourth project in the nano degree program of Data Analysis by Udacity. In this project I leveraged the R libraries to explore white wine quality dataset available here and the description of the variables is provided here

This dataset consists of physicochemical properties of winesamples collected to understand how these properties affect quality of the wine. Quality, in this dataset is a sensory preferance given by expert wine tasters on a sclae of 0-10 where 0 indicates “bad” and 10 indicates “good”.

This project is organized to include my thought process as I analyze different physicochemical properties, their relationships and their affect on sensory based quality. I included univariate, bivariate, multivariate analysis along with different visualizations.

Loading the data set in to workspace.

# Load the Data
winedata=read.csv("wineQualityWhites.csv")

After loading the data set, I want to see the outline of the data set. For this I will use summary and dim functions.

# to get the dimensions of the data set
dim(winedata)
## [1] 4898   13
#to get the names of the variables
names(winedata)
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
#structure of the data set
str(winedata)
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
# to get the summary of all variables
summary(winedata)
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

As seen from the above result, this data set has 4898 rows and 13 variables. From the summary it is observed that the first variable “x” is used as an index and it is not required for the analysis. So, I will delete this variable before proceeding to further analysis.

The variable names are very descriptive and explains what physicochemical property they represent. The maximum value for almost all variables is very high compared to the median value indicating the presence of outliers. I want to plot histograms to see if my intuition is correct. Plotting histograms also explains the distribution of the variable.

Quality is taken as a quantittative variable as indicated by mean instead of qulitative value. I will use it to craete a new quality_factor variable as a ordered factor. Residual sugar, and the two types of sulfur dioxide have high variablity as can be seen from their minimum and maximum values.Citric acid has a minimum value of 0. The pH values are all lesthan 4 indicating wine is acidic.

# deleting "x" variable
winedata<-winedata[,-1]

# creating a ordered factor variable representing quality
winedata$quality_factor <- factor(winedata$quality, ordered = TRUE)

# looking at the summary afteer the intial modifications
summary(winedata)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##                                                                     
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##                                                            
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##                                                                   
##     quality      quality_factor
##  Min.   :3.000   3:  20        
##  1st Qu.:5.000   4: 163        
##  Median :6.000   5:1457        
##  Mean   :5.878   6:2198        
##  3rd Qu.:6.000   7: 880        
##  Max.   :9.000   8: 175        
##                  9:   5

As seen from the summary of quality_factor variable there are very less samples representing quality levels of 3 and 9. Also, quality level of 6 is most represented in this sample dataset. There are no samples representing quality levels of 0,1,2 and best quality 10.I believe that the analysis of this data can be improved by grouping quality levels as “bad”, “average”, and “good”. This can be done becuase there is no specific cut for each quality level. They all are sensory perception results by wine tasters.

# attaching the dataset here so that code can be readable
attach(winedata)
# creating a quality rating variable
winedata$rating <- factor(
                   ifelse((quality == '3'| quality == '4'),"bad",
                    ifelse((quality == '5'| quality== '6'),"average","good")),
                   levels=c('bad','average','good')
                   )               

For every sample having a quality value of 3 or 4 I grouped them as having a rating of “bad” and for samples of quality 5 or 6, I grouped them as having a rating of “average”, and for all other samples having quality 7,8,9 as “good”.

summary(winedata)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##                                                                     
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##                                                            
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##                                                                   
##     quality      quality_factor     rating    
##  Min.   :3.000   3:  20         bad    : 183  
##  1st Qu.:5.000   4: 163         average:3655  
##  Median :6.000   5:1457         good   :1060  
##  Mean   :5.878   6:2198                       
##  3rd Qu.:6.000   7: 880                       
##  Max.   :9.000   8: 175                       
##                  9:   5

There are 1060 observations having a rating of “good” and 183 having rating “bad” and the rest having a rating “average”.

Univariate Analysis

I want to see how the ariables are distributed. Whenever there is an indication of outliers from the summary, I used limits to exclude some of the outliers.

p1 = ggplot(winedata, aes(x=quality))+
     geom_bar()
p2 = ggplot(winedata, aes(x=rating,fill = rating))+
     geom_bar()
grid.arrange(p1,p2,ncol=2)

As seen from the plot on the left, the number of observations falling under higher quality is less. Hence my new rating variable is created. This new rating variable distribution is shown on the right. Most of the samples fall under average quality.

ggplot(winedata, aes(x=fixed.acidity))+
  xlim(quantile(winedata$fixed.acidity,0.01),
       quantile(winedata$fixed.acidity,0.99))+
  geom_histogram(binwidth = 0.1)

From the above graph it can be clearly identified that the fixed acidity vriable is approximately normally distributed. This fact can be used when modelling for quality using this variable.

ggplot(winedata,aes(x=volatile.acidity))+
  xlim(quantile(winedata$volatile.acidity,0.01),
       quantile(winedata$volatile.acidity,0.99))+
  geom_histogram(binwidth = 0.01)

It can be seen that volatile acidity distribution is right skewed. I want to see if a transformation of this variable makes it any near to normal distribution.

ggplot(winedata,aes(x=volatile.acidity))+
  geom_histogram()+
  scale_x_log10()

As can be seen, this log transform of volatile acidity is approximately normal. I will use this transformation whenever I need to use volatile acidity.

ggplot(winedata,aes(x=citric.acid))+
  xlim(quantile(winedata$citric.acid,0.01),
       quantile(winedata$citric.acid,0.99))+
  geom_histogram(binwidth = 0.01)

Citric acid is approximately normal after excluding outliers.

ggplot(winedata,aes(x=residual.sugar))+
  xlim(quantile(winedata$residual.sugar,0.01),
       quantile(winedata$residual.sugar,0.99))+
  geom_histogram(binwidth = 0.1)

Residual sugar distribution is very different from normal distribution. I would like to use a log transform and see if the distribution changes.

ggplot(winedata,aes(x=residual.sugar))+
  scale_x_log10()+
  geom_histogram()

Even this log transform is not near to normal, but it looks like this log transform has bi modal distribution. This distribution seems better compared to the untransformed variable. I will use this in my further analysis.

ggplot(winedata,aes(x=chlorides))+
  xlim(quantile(winedata$chlorides,0.01),
       quantile(winedata$chlorides,0.96))+
  geom_histogram(binwidth = 0.005)

Most of the chloride content range from 0.009 to 0.05 g/dm^3. There are some outliers which I excluded in this graph. If I consider this variable for further analysis then I will include all values of chlorides to see its affect on wine quality.

a1 = ggplot(winedata,aes(x=free.sulfur.dioxide))+
     xlim(quantile(winedata$free.sulfur.dioxide,0.01),
       quantile(winedata$free.sulfur.dioxide,0.99))+
     geom_histogram(binwidth = 1)
a2 = ggplot(winedata,aes(x=total.sulfur.dioxide))+
     xlim(quantile(winedata$total.sulfur.dioxide,0.01),
       quantile(winedata$total.sulfur.dioxide,0.99))+
     geom_histogram(binwidth = 1)
grid.arrange(a1,a2, ncol=2)

The two forms sulfur dioxide have normal distribution.

ggplot(winedata,aes(x=density))+
       xlim(quantile(winedata$density,0.001),
         quantile(winedata$density,0.999))+
       geom_histogram(binwidth = 0.00065)

The density values are appoximately closer to 1, density of water.

ggplot(winedata,aes(x=sulphates))+
       xlim(quantile(winedata$sulphates,0.01),
         quantile(winedata$sulphates,0.99))+
       geom_histogram(binwidth = 0.01)

Sulphates have approximate normal distribution, but there is a small skewness towards right. Most of the values are less than 0.6

ggplot(winedata,aes(x=alcohol))+
       geom_histogram(binwidth = 0.1)

Alcohol content varies from 8 to 14 but most of the values range from 9-13.

ggplot(winedata,aes(x=pH))+
       xlim(quantile(winedata$pH,0.01),
         quantile(winedata$pH,0.99))+
       geom_histogram(binwidth = 0.05)

As already noted from the summary, all these pH values indicate that wine is acidic. most of the values range from 3.0 - 3.4 Now, I got an idea of how each individual variable is distributed.

Before proceeding further, because the number of high quality samples are very less, I would like to see all the variables corresponding to these high quality wine samples.

winedata[(winedata$quality=='9'),]
##      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 775            9.1             0.27        0.45           10.6     0.035
## 821            6.6             0.36        0.29            1.6     0.021
## 828            7.4             0.24        0.36            2.0     0.031
## 877            6.9             0.36        0.34            4.2     0.018
## 1606           7.1             0.26        0.49            2.2     0.032
##      free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates
## 775                   28                  124 0.99700 3.20      0.46
## 821                   24                   85 0.98965 3.41      0.61
## 828                   27                  139 0.99055 3.28      0.48
## 877                   57                  119 0.98980 3.28      0.36
## 1606                  31                  113 0.99030 3.37      0.42
##      alcohol quality quality_factor rating
## 775     10.4       9              9   good
## 821     12.4       9              9   good
## 828     12.5       9              9   good
## 877     12.7       9              9   good
## 1606    12.9       9              9   good

For these high quality sample wines, most of the variables have values in the average range. But one strange thing identified from this analysis is that, the alcohol value for all these samples is more than 12 except for the row 775 sample. But, the residual sugar value for this sample is very high, 10.6, ompared to the remaining four samples. This created a doubt for the existence of a relationship between quality, alcohol content and residual sugar. In my further analysis, I will observe this association more closely to find out if this relationship does actually exist or it is just a outlier.

Before I attempt to create any model to predict wine quality, I want to see if any collinearity exists between these predictor variables. I strongly believe that some of the variables are correlated. Also, I want to findout the correlation between quality and these predictor variables.

Multicollinearity is very problematic and its identification before modelling can help us in using better predictors in the model.

Bivariate and Multivariate Analysis

corrplot.mixed(cor(winedata[,1:12]),tl.pos = "lt")

Correlation between quality and othe rvariables:

All other variables have either very less or no correlation with quality.

Correlation among other variables:

Residual Sugar and density have a correlation of 0.84. This is a very strong relationship. Alcohol and density have a strong negative correlation, -0.78. The two forms of sulfur dioxide have a correlation of 0.62, strong correlation as predicted. Density has a correlation of 0.53 with total sulfur dioxide. Alcohol has a negative correlation of -0.45 with both residual sugar and total sulfur dioxide. Alcohol and chlorides have a -0.36 correlation.

The most important observation is the negative correlation between residual sugar and alcohol, as I predicted for the existence of a relationship between these two variables above. This proved that the sample 775 observed above is not just a outlier. Mostly, alcohol and density are correlated with other variables and also with quality indicating that these can be good predictors in a model for predicting quality.

As I found that correlation exists between some of the variables, I want to plot them to better understand their relationship. From the above plot I understood that alcohol content, residual sugar, density, chlorides, sulfur dioxide affect quality either directly or becuase of their relationship with other variables. Now I will plot some of these relationships.

ggplot(winedata,aes(y=alcohol,x=quality_factor, fill = rating))+
  geom_boxplot()

From this plot it is very clear that alcohol is good predictor and quality improves with high alcohol content. But I want to test this intuition when other correlating avriables are present. Now I’ll consider affect of residual sugar in the rpesence of alcohol content.

ggplot(winedata,aes(x=alcohol, y=residual.sugar, color = rating))+
   ylim(quantile(winedata$residual.sugar,0.0),
       quantile(winedata$residual.sugar,0.99))+
  geom_point()+
  geom_vline(xintercept = 11.5, color = "red")

Most of the high quality wines have a alcohol content of above 11.5 as indicated by the red vertical line. This plot is cumbersome with all quality of wines. So I decided to plot only good quality wines.

wine_good = winedata[(winedata$rating=="good"),]
ggplot(wine_good,aes(x=alcohol, y=residual.sugar))+
   ylim(quantile(wine_good$residual.sugar,0.0),
       quantile(wine_good$residual.sugar,0.99))+
  geom_point()+
  geom_vline(xintercept = 9, color = "red")+
  geom_hline(yintercept = 8, color = "blue")

When the alcohol content is further less, then samples having high residual sugar values, above blue line, have good quality rating. From the above graph, I conclude that alcohol is very stron g predictor and residual sugar is important only when the alcohol content is less.

Now, I want to see how density affects the relationship of alcohol and quality given that alcohol and density have a negative correlation.

ggplot(winedata,aes(x = alcohol, y = density, color = rating))+
  ylim(quantile(winedata$density,0.0),
       quantile(winedata$density,0.99))+
  geom_point()+
  geom_vline(xintercept = 11, color = "black")

Again, alcohol is dominating even in the presence of density. samples having alcohol content above 11 (black vertical line) tends to have higher quality value irrespective of the density value. Hence, as suggested by the correlatin plot, the affect of density is taken care by the alcohol content as a predictor for predicting quality.

Now, I want to see how total sulfur dioxide affects quality in the presence of alcohol content as it has a negative correlation with alcohol.

ggplot(winedata,aes(x = alcohol, y = total.sulfur.dioxide, color = rating))+
  geom_point()

Again, it is observed that alcohol dominates. From all these graphs I conclude that alcohol is very strong predictor.

Now I want to see how chlorides affects wine quality.

ggplot(winedata,aes(x = quality_factor, y = chlorides, fill = rating))+
geom_boxplot()

As I previously noted there are outliers in chloride distribution that can be seen in the bove diagram for quality levels of 5 and 6.

ggplot(winedata,aes(x = quality_factor, y = chlorides, fill = rating))+
  ylim(quantile(winedata$chlorides,0.),
       quantile(winedata$chlorides,0.9))+
geom_boxplot()

Even though this plot suggests that quaity is more for lesser chloride content I suspect that this relationship must be further analyzed as the number of samples representing the quality level of 9 are just five. So I will plot a boxplot using rating variable.

ggplot(winedata,aes(x = rating, y = chlorides, fill = rating))+
  ylim(quantile(winedata$chlorides,0.),
       quantile(winedata$chlorides,0.9))+
geom_boxplot()

After plotting with rating variable, it still shows that lesser chloride content results in sample of good quality. So, I conclude chloride content is also a good predictor for quality. But, I want to further test its affect in the presence of other variables.

ggplot(winedata,aes(x=alcohol,y=chlorides,color = rating))+
  geom_point()

In presence of alcohol, chlorides does not have any affect on quality as a predictor. This further strengthed my previous observation that alcohol is a strong predictor for quality.

Model Building

I want to build a model to predict quality based on the observations from the above plots. First I want to perform a svm model using alcohol and density as predictors.

model1<-svm(rating ~ alcohol+density, winedata)
preds1<-predict(model1, winedata)

# finding accuracy
table(winedata$rating,preds1)
##          preds1
##            bad average good
##   bad        0     177    6
##   average    0    3508  147
##   good       0     825  235

accuracy = (3508+235)/(4898) = 0.7641

This is a good accuracy but, I used the same training data to find out accuracy. So, it is not reliable but still is a good estimate. Now I want to check for the affect of chloride as predictor besides alcohol.

model2<-svm(rating ~ alcohol+chlorides, winedata)
preds2<-predict(model2, winedata)

# finding accuracy
table(winedata$rating,preds2)
##          preds2
##            bad average good
##   bad        0     179    4
##   average    0    3496  159
##   good       0     815  245

accuracy = (3496+245)/4898 = 0.7637

This is almost same as the previous one. This further strengthend my intuition that even though in th evisualization, chloride afected quality, over all it is not that effective.

model3<-svm(rating ~ alcohol+total.sulfur.dioxide, winedata)
preds3<-predict(model3, winedata)

# finding accuracy
table(winedata$rating,preds3)
##          preds3
##            bad average good
##   bad        2     181    0
##   average    0    3515  140
##   good       0     844  216

accuracy = (2+3515+216)/4898 = 0.76212

Again this model did not improve my accuracy. One point to remember here is that I am testing my models on the same data I used to train. This is because I don’t have enough data of quality values 9 so as to build a model with training data and test it with testing data.

Now, I want to test the model with only Alcohol, as all my previous analysis proved that quality is modtly affected by alcohol and in its presence as a predictor the other variables are not that effective.

model4<-svm(rating ~ alcohol, winedata)
preds4<-predict(model4, winedata)

# finding accuracy
table(winedata$rating,preds4)
##          preds4
##            bad average good
##   bad        0     177    6
##   average    0    3478  177
##   good       0     805  255

accuracy = (3478+255) = 0.7621

This is interesting!! Just one predictor , alcohol content, produces an accuracy of almost same as previous models. This strengthend my orevious statements that alchols supreeses the affect of other variables as predictors in its presence.

The best model so far is first model where I used alcohol along with density.

Final Plots

Histogram of Wine Quality

First I’ll plot how quality is distributed in this dataset.

a1 = ggplot(winedata,aes(x = as.factor(quality)))+
 geom_bar(color = I('black'),fill = I('#981133')) + 
  xlab("Wine Quality")+
  ylab("Number of Samples")+
  ggtitle("Histogram of Wine Quality")

a2 = ggplot(winedata, aes(x= rating,fill = rating))+
  geom_bar()+
  ggtitle(" Histogram of Wine Rating")+
  xlab("Wine Rating")+
  ylab("Number of Samples")
grid.arrange(a1,a2,ncol = 2)

This is the first and most important plot of this entire analysis. This plot shows that most of the samples have a quality level of 6. Also, very few samples represented quality level of 3 ,4, 8 and 9. The plot to the left helped me in creating the plot to the right where I created a rating variables taking 7,8, and 9 quality level as good, 3,4 quality level as bad and the rest as average. This new variable helped me when I performed multivariate analysis.

Quality VS Alcohol Content

Now, I want to plot the relationship between quality and alcohol content.

ggplot(winedata,aes(y=alcohol,x=quality_factor, fill = rating))+
  geom_boxplot()+
  ggtitle("Variation of Quality with Alcohol Content")+
  ylab(" Alcohol content (% by Vol)")+
  xlab("Quality level")

This plot is the heart of this analysis. This clearly shows the affect of alcohol on the quality level, which proved to be the most effective predictor. This plot shows that as the alcohol content is increasing, the quality level increased. This can be taken in an other way as all the wine tasters who performed this quality check liked alcohol. THis plot is the starting point for my multivariate analysis as many other variables are correlated with alcohol.

** Quality Level with Alcohol and Density**

Now I want to plot the affect of alcohol content and densoty together with quality as this combination proved to be the best model using SVM in my predictive model.

ggplot(winedata,aes(y=density,x=alcohol, color = rating))+
  geom_point()+
  ggtitle("Variation of Quality with Alcohol and Density")+
  ylim(quantile(winedata$densityr,0.01),
       quantile(winedata$density,0.95))+
  xlab(" Alcohol content (% by Vol)")+
  ylab("Density (g / cm^3)")

This plot showed that eventhough density is a good predictor, in presence of alcohol its affect is very less as samples having higher alcohol content tends to have high rating. But for samples having low alcohol content this might be a good predictor, as my best model was obtained combining alcohol and density. Having density improved my accuracy slightly more cmpared to having only alchol as predictor.

Summary and reflection

In this project I learne to analyze a large data set. Eventhough I didn’t know anything about wine, I now understood that the quality of wine depends largely on alcohol content and density. I learned to use ggplot2 library and also reading documentation whenever required to plot good visualizations.

I tarted with the summary of the data set and slowly built my analysis atarting with univariate analysis. This provided me with what to consider later in my analysis. I googled to find out abour corrplot library to draw the visual I used, as the base graph plot was very slow. I’m now confident to analyze further more large data sets.

Further Improvements

In this analysis, I used same data for both training and testing my models. Someone can use better cross validation methods to find the true accuracy of the models I built. This needs more attention because there are very less samples representing the quality levels of 3,4,8 and 9.

In my models, the predicted classes never predicted quality level “bad” in most of the models. It would be interesting to know why this happend as my training data has a lot of samples representing “bad” quality rating.

WHen I plotted alcohol affect vs quality I observed that quality level 0f 5 has an average alcohol content less than that of both quality levels 4 and 6. It would be interesting to know how thi happend given the fact that the general trend was to have high quality with increasing alcohol content.